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Abstract 

We consider unsupervised estimation of mixtures of discrete graphical models, where the class 
variable corresponding to the mixture components is hidden and each mixture component over 
the observed variables can have a potentially different Markov graph structure and parameters. 
We propose a novel approach for estimating the mixture components, and our output is a 
tree-mixture model which serves as a good approximation to the underlying graphical model 
mixture. Our method is efficient when the union graph, which is the union of the Markov 
graphs of the mixture components, has sparse vertex separators between any pair of observed 
variables. This includes tree mixtures and mixtures of bounded degree graphs. For such models, 
we prove that our method correctly recovers the union graph structure and the tree structures 
corresponding to maximum-likelihood tree approximations of the mixture components. The 
sample and computational complexities of our method scale as poly(p, r), for an r-component 
mixture of p-variate graphical models. We further extend our results to the case when the union 
graph has sparse local separators between any pair of observed variables, such as mixtures of 
locally tree-like graphs, and the mixture components are in the regime of correlation decay. 

Keywords: Graphical models, mixture models, spectral methods, tree approximation. 
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1 Introduction 

Graphical models offer a graph-based framework for representing multivariate distributions, where 
the structural and qualitative relationships between the variables are represented via a graph struc- 
ture, while the parametric and quantit ative relati o nship s are represented via values assigned to 
different groups of nodes on the graph ( Lauritzenl . 119961 ) . Such a decoupling is natural in a va- 
riety of contexts, including computer vision, financial modeling, and phylogenetics. Moreover, 
graphica l models are amenable to efficie nt inference via distributed algorithms such as belief prop- 
agation ( Wainwright and Jordanl . |2008| ). Recent innovations have made it feasible to train these 
models with low computational and sample requirements in high dimensions (see Section 11.21 for a 
brief overview) . 

Simultaneously, much progress has been made in analyzing mixture models (|Lindsavl . Il995l ) . A 



mixture model can be thought of as selecting the distribution of the manifest variables from a fixed 



set, depending on the realization of a so-called choice variable, which is latent or hidden. Mixture 
models have widespread applicability since they can account for changes in observed data based on 
hidden influences. Recent works have provided provable guarantees for learning high-dimensional 
mixtures under a variety of settings (See Section [L2]) . 

In this paper, we consider mixtures of (undirected) graphical models, which combines the mod- 
eling power of the above two formulations. These models can incorporate context- specific depen- 
dencies, where the structural (and parametric) relationships among the observed variables can 
change depending on a hidden context. These models allow for parsimonious representation of 
high-dimensional data, while retaining the computational advantage of performing inference via 
belief propagation and its variants. The current practice for learning mixtures of graphical models 
(and other mixture models) is based local-search heuristics such as expectation maximization (EM) . 
However, EM scales poorly in the number of dimensions, suffers from convergence issues, and lacks 
theoretical guarantees. 

In this paper, we propose a novel approach for learning graphical model mixtures, which offers 
a powerful alternative to EM. At the same time, we establish theoretical guarantees for our method 
for a wide class of models, which includes tree mixtures and mixtures over bounded degree graphs. 
Previous theoretical guarantees are mostly limited to mixtures of product distributions (see Sec- 
tion II. 2p . These models are restrictive since they posit that the manifest variables are related to 
one another only via the latent choice variable, and have no direct dependence otherwise. Our work 
is a significant generalization of these models, and incorporates models such as tree mixtures and 
mixtures over bounded degree graphs. 

Our approach aims to approximate the underlying graphical model mixture with a tree- mixture 
model. In our view, a tree-mixture approximation offers good tradeoff between data fitting and 
inferential complexity of the model. Tree mixtures are attrac tive since inference reduces to belief 



propagation on the component trees (JMeila and Jordanl . l200ll ) . Tree mixtures thus present a mid- 



dle ground between tree graphical models, which are too simplistic, and general graphical model 
mixtures, where inference is not tractable, and our goal is to efficiently fit the observed data to a 
tree mixture model. 

1.1 Summary of Results 

We propose a novel method for learning discrete graphical mixture models. It combines the tech- 
niques used in graphical model selection based on conditional independence tests, and the spectral 
decomposition methods employed for estimating the parameters of mixtures of product distribu- 
tions. Our method proceeds in three main stages: graph structure estimation, parameter estima- 
tion, and tree approximation. 

In the first stage, our algorithm estimates the union graph structure, corresponding to the union 
of the Markov graphs of the mixture components. We propose a rank criterion for classifying a node 
pair as neighbors or non-neighbors in the union graph of the mixture model, an d can be viewed as a 



generalization of conditional-independence tests for graphical model selection (lAnandkumar et al. 



2012d : iBresler et al.l . l2008l : ISpirtes and Meekl . Il995l ) . Our method is efficient when the union graph 



has sparse separators between any node pair, which holds for tree mixtures and mixtures of bounded 
degree graphs. The sample complexity of our algorithm is logarithmic in the number of nodes. Thus, 
our method learns the union graph structure of a graphical model mixture with similar guarantees 
as graphical model selection (i.e., when there is a single graphical model). 

In the second stage, we use the union graph estimate Gy to learn the pairwise marginals of 



the mixture components. Since the choice variable is hidden, this involves decomposition of the 
observed statistics into component models. We leverag e on the spectral decomposition rnethod 
developed for learning m ixtures of product distributions ( Anandkumar et al.l . l2012al : IChand . 119961 : 
Mossel and Rochl . 120061 ). In a mixture of product distributions, the observed variables are con- 
ditionally independent given the hidden class variable. We adapt this method to our setting as 
follows: we consider different triplets over the observed nodes and condition on suitable separator 
sets (in the union graph estimate Gy) to obtain a series of mixtures of product distributions. Thus, 
we obtain estimates for pairwise marginals of each mixture component (and in principle, higher 
order moments) under some natural non-degeneracy conditions. In the final stage, we find the 
best tr ee approximation to the estimated component marginals via the standard Chow-Liu algo- 
rithm ( Chow and Liul . ll968l ). The Chow-Liu algorithm produces a max- weight spanning tree using 
the estimated pairwise mutual information as edge weights. We establish that our algorithm re- 
covers the correct tree structure corresponding to maximum-likelihood tree approximation of each 
mixture component. In the special case, when the underlying distribution is a tree mixture, this 
implies that we can recover tree structures corresponding to all the mixture components. The 
computational and sample complexities of our method scale as poly(p, r), where p is the number 
of nodes and r is the number of mixture components. 

Recall that the success of our method relies on the presence of sparse vertex separators between 
node pairs in the union graph, i.e., the union of Markov graphs of the mixture components. This 
includes tree mixtures and mixtures of bounded degree graphs. We extend our r nethods and analysis 



the a l arger family of models, where the union graph has sparse local separators (lAnandkumar et al. 



2012d ). which is a weaker criterion. This family includes locally tree- like graphs (including sparse 
random graphs), and augmented graphs (e.g. small- world graphs where there is a local and a global 
graph). The criterion of sparse local separation significantly widens the scope, and we prove that 
our me thods succeed in the s e mod els, when the mixture components are in the regime of correlation 
decay (lAnandkumar et al.l . l2012d ). The sample and computational complexities are significantly 
improved for this class, since it only depends on the size of local separators (while previously it 
depended on the size of exact separators). 

Our proof techniques involve establishing the correctness of our algorithm (under exact statis- 
tics). The sample analysis involves careful use of spectral perturbation bounds to guarantee success 
in finding the mixture components. In addition, for the setting with sparse local separators, we 
incorporate the correlation decay rate functions of the component models to quantify the additional 
distortion introduced due to the use of local separators as opposed to exact separators. One caveat 
of our method is that we require the dimension of the node variables d to be larger than the number 
of mixture components r. In principle, this limitation can be overcome if we consider larger (fixed) 
groups of nodes and implement our method. Another limitation is that we require full rank views 
of the latent factor for our method to succeed. However, this is also a requirement imposed for 
learning mixtures of product distributions. Moreover, it is known that learning singular models, 
i.e., those which do not satisfy the above rank condition, is at least as hard as le arning parity with 
noise, which is conjectured to be computationally hard ( Mossel and Rochl . |2006| ) . Another restric- 
tion is that we require the presence of an observed variable, which is conditionally independent 
of all the other variables, given the latent choice variable. However, note that this is significantly 
weaker than the case of product mixture models, where all the observed variables are required to 
be conditionally independent given the latent factor. To the best of our knowledge, our work is 
the first to provide provable guarantees for learning non-trivial graphical mixture models (which 



are not mixtures of product distributions), and we believe that it significantly advances the scope, 
both on theoretical and practical fronts. 

1.2 Related Work 

Our work lies at the intersection of learning mixture models and graphical model selection. We 
outline related works in both these areas. 



Overview of Mixture Models: Mixture models have been extensively studied ( Lindsayl . ll995l ) 

and are employed in a variety of applications. More recently, the focus has been on learning 
mixture models in high dimensions. There are a number of rece nt works d e aling with estimation of 
high-dimensional Gaussian mixtures, starting from the work of iDasguptal (119991) for learning wel l- 
separated components, and most recently by ( Belkin and Sinhal . l20ld : iMoitra and Valiantl . 12010 ). 
in a long line of works. These works provide guarantees on recovery under various separation 
constraints between the mixture components and/or have computational and sample complexities 
growing exponentially in the number of mixture components r. In contrast, the so-called spectral 
methods have both computational and sample complexities scaling only polynomially in the number 
of components, and do not impose stringent separation constraints, and we outline below. 



Spectral Methods for Mixtures of Product Distributions: The classical mixture model 
over product distributions consists of multivariate distributions with a single latent variable H 
and t he observed variables are c onditionally independent under each state of the laten t vari- 
able (Lazarsfeld and Henry . 196S). Hierarchical latent class (HLC) models ( Chen et al.1 . 12008 : 
Zhang] . I2OOJ; IZhang and Kockal . I2OOJ) generalize these models by allowing for multiple latent vari- 
ables. Spectr al methods we r e first emp l oyed for learning discrete (hier archical) mixtures of product 
distributions ( Changj . Il996l : IHsu et al.1 . l2009l : iMossel and Roch . 2006) and have been recently ex- 
tended for learning general multiview mixtures (jAnandkumar et al., 2012ai ) . The method is based 
on triplet and pairwise statistics of observed variables and we build on these methods in our work. 
Note that our setting is not a mixture of product distributions, and thus, these methods are not 
directly applicable. 



Graphical Model Sel ection: Graphical model selection is a well studied problem starting 
from the seminal work of IChow and Liul ( 19681 ) for finding the best tree approximation of a graph- 
ical model. They established that maximum likelihood estimation reduces to a maximum weight 
spanning tree problem where the edge weights are given by empirical mutual information. How- 
ever, the problem becomes more challenging when either some of the nodes are hidden (i.e., latent 
tree models) or we are interested in estimating loopy graphs. Learning the st ructure of latent tre e 



models has been studied extensively, mainly in the context of phylogenetics (IDurbin et al.1. Il999l) 
Efficient algorithrns with provable performance guarantees are avai lable, e.g. (lAnandkumar et al 



201 ll : IChoi et al.1 . I2OI1I : lOaskalakis et al.l . I2OO6I : Erdos et al.l . Il999l ). Works on high-dimensional 



loopy graphical model selection are more recent. The approaches ca,n be c l assified into mainly 
two g r oups: non-convex local approaches (Ariandkumar and Valluvanl . I2OI2I : lAnandkumar et al. 



optimizatio n (jChandrasekaran et al.1 . l20ld : iMeinshausen and Biihlmannl . l2006l : 



2012cl : iBresler et aP. I2OO8I: Ualah et al.1. [2OI1I: iNetrapalh et al.1. l20ld) an d those based 



on convex 



20081 . I2OIII ). There is also some recent work on learning conditional models, e.g. (JGuo et al. 



Ravikumar et al 



2011 



However, these works are not directly applicable for learning mixtures of graphical models. 



Mixtures of Graphical Models: Works on learning mixtures of graphical models (other 
th an mixtures of product distri ljutions) are fewer, and m ostly focus on tree mixtures. The works 
of iMeila and Jordan ( 2001 ) and Kumar and Kolleii ( 20091 ) consider EM-based approaches for learn- 
ing tree mixtures, iThiesson et alJ ( 19991 ) extend the approach to learn mix tures of graphical mod- 



els on directed acyclic graphs (DAG), termed as Bayesian mul tinets by iGeiger and Heckerman 



( 19961 ) ■ using the Cheeseman-Stutz asymptotic approximation and lArmstrong et alJ ( 20091 ) consider 
a Bayesian approach by assigning a prior to decomposable graphs. However, these approaches do 
not have any theoretical guara r itees. 

Recently, iMossel and RochI (J201ll ) consider structure learning of latent tree mixtures and provide 
conditions under which they can be successfully recovered. Note that this model can be thought 
of as a hierarchical mixture of product distributions, where the hierarchy change s according to the 
realiza tion of the choice variable. Our setting differs substantially from this work. iMossel and Roch 
( 20 111 ) require that the component latent trees of the mixture be very different, in order for the 
quartet tests to distinguish them (roughly) , and establish that a uniform selection of trees will ensure 
this condition. On the other hand, we impose no such restriction and allow for graphs of different 
components to be same/different (although our algorithm is efficient whe n the overlap betw e en the 
component graphs is more). Moreover, we allow for loopy graphs wh il e iMqssel and RochI (|201ll ) 
restrict to learning latent tree mixtures. However, JMossel and RochI (J201ll ) do allow for latent 



variables on the tree, while we a, s sume that all variables to be observed (except for the latent 
choice variable). JMossel and RochI (l201ll') consider only s tructu re learning, while we consider both 
structure and parameter estimations. iMossel and RochI ( 201ll ) limit to finite number of mixtures 
r = 0(1), while we allow for r to scale with the number of variables p. As such, the two methods 
operate in significantly different settings. 



2 System Model 

2.1 Graphical Models 

We first introduce the concept of a graphical model and then discuss mixture models. A graphical 
mode l is a family of multivariate distributions Markov on a given undirected graph (JLauritzenl . 
1990 ). In a discrete graphical model, each node in the graph f G y is associated with a random 
variable Y^ taking value in a finite set y and let d := \y\ denote the cardinality of the set. The set 
of edgeqll E C [2) captures the set of conditional-independence relationships among the random 
variables. We say that a vector of random variables Y := (Yi, . . . , Yp) with a joint probability mass 
function (pmf ) P is Markov on the graph G if the local Markov property 



P{yv\yM{i)) = P{yv\yv\v 



(1) 



holds for all nodes v & V, where M{v) denotes the open neighborhood of v (i.e., not including v). 
More generally, we say that P satisfies the global Markov property for all disjoint sets A,BcV 

PiYA, yB\ys{A,B;G)) = PiyA\ys{A,B;G))PiyB\ys{A,B;G)), yA,BcV: M[A] n M[B] = 0. (2) 

where the set S{A,B;G) is a node separatoTGbetween A and B, and M[A] denotes the closed 
neighborhood of A (i.e., including A). The global and local Markov properties are equivalent under 



^We use notations E and G interchangeably to denote the set of edges. 

A set S{A,B;G) C V^ is a separator of sets A and B if the removal of nodes in S{A,B;G) separates A and B 
into distinct components. 



the positivity condition, given by P{y) > 0, for all y G y^ ( Lauritzenl . 1 19961 ). Henceforth, we say 



that a graphical model satisfies Markov property with respect to a graph, if it satisfies the global 
Markov property. 

The Hammersley-Clifford theorem ( Bremaudl . Il999l ) states that under the positivity conditi 



tion, 



a distribution P satisfies the Markov property according to a graph G iff. it factorizes according 
to the cliques of G, 

P{y) = \ exp [y^ *,(y,) ] , (3) 

where C is the set of cliques of G and yc is the set of random variables on clique c. The quantity 
Z is known as the 'partition function and serves to normalize the probability distribution. The 
functions ^c are known as potential functions. We will assume positivity of the graphical models 
under consideration, but otherwise allow for general potentials (including higher order potentials). 

2.2 Mixtures of Graphical Models 

In this paper, we consider mixtures of discrete graphical models. Let H denote the discrete hidden 
choice variable corresponding to the selection of a different components of the mixture, taking 
values in [r] := {1, . . . ,r} and let Y denote the observed variables of the mixture. Denote t^h := 
\P{Ji = h)]J^ as the probability vector of the mixing weights and Gh as the Markov graph of the 
distribution P{y\H = h). 

Our goal is to learn the mixture of graphical models, given n i.i.d. samples y" = [yi, . . . ,yra]~'' 
drawn from a p-variate joint distribution P(y) of the mixture model, where each variable is a 
d-dimensional discrete variable. The component Markov graphs {Gh}h corresponding to models 
{P{'y\H = h)}fi are assumed to be unknown. Moreover, the variable H is latent and thus, we do not 
a priori know the mixture component from which a sample is drawn. This implies that we cannot 
directly apply the previous methods designed for graphical model selection. A major challenge is 
thus being able to decompose the observed statistics into the mixture components. 

We now propose a method for learning the mixture components given n i.i.d. samples y" drawn 
from a graphical mixture model -P(y). Our method proceeds in three main stages. First, we 
estimate the graph Gy := U^^^G/^, which is the union of the Markov graphs of the mixture. This 
is accomplished via a series of rank tests. Note that in the special case when Gh = Gy, this gives 
the graph estimates of the component models. We then use the graph estimate G\j to obtain the 
pairwise marginals of the respective mixture components via a spectral decomposition method. 
Finally, we use the Chow-Liu algorithm to obtain tree approximations {Th}h of the individual 
mixture components 

3 Estimation of the Union of Component Graphs 

Notation: Our learning method will be based on the estimates of probability matrices. For any 
two nodes u,v £ V and any set S C V \ {u, v}, denote the joint probability matrix 

MuMS;k} ■■= [P{Yu = i,Y,= j,Ys = k)]ij, k G 3^l^l. (4) 



^Our method can also be adapted to estimating the component Markov graphs {Gh}h and we outhne it and other 
extensions in Appendix lA.il 



Let Af"^ r^.^i denote the corresponding estimated matrices using samples y" 

K,v,{S;k} ■■= [P^'iYu = i,Y,= j, Ys = fc)],,„ (5) 

where P" denotes the empirical probability distribution, computed using n samples. We consider 
sets S satisfying 15*1 < rj, where 77 depends on the graph family under consideration. Thus, our 
method is based on (rj + 2)"' order statistics of the observed variables. 

Intuitions: We provide some intuitions and properties of the union graph Gy = U^^-j^G/i, where 
Gh is the Markov graph corresponding to component H = h. Note that Gy is different from 
the Markov graph corresponding to the marginalized model P(y) (with latent choice variable H 
marginalized out). Yet, Gy represents some natural Markov properties with respect to the observed 
statistics. We first establish the simple result that the union graph Gy satisfies Markov property 
in each mixture component. Recall that S{u,v;G\j) denotes a vertex separator between nodes u 
and V in Gy, i.e., its removal disconnects u and v in G\j- 

Fact 1 (Markov Property of Gu) For any two nodes u,v gV such that {u,v) ^ G[j, 

Yu±Y,\Ys,H, S:=S{u,v;Gu). (6) 

Proof: The set S := S{u, v; G[j) is also a vertex separator for u and v in each of the component 
graphs Gh- This is because removal of S disconnects u and v in each Gh- Thus, we have Markov 
property in each component: Yu JL YulYs, {H = h}, for /i G [r], and the above result follows. □ 
Thus, the above observation implies that the conditional independence relationships of each 
mixture component are satisfied on the union graph G[j conditioned on the latent factor H. The 
above result can be exploited to obtain union graph estimate as follows: two nodes u, v are not 
neighbors in Gij if a separator set S can be found which results in conditional independence, as 
in ([6]). The main challenge is indeed that the variable H is not observed and thus, conditional 
independence cannot be directly inferred via observed statistics. However, the effect of H on the 
observed statistics can be quantified as follows: 

Lemma 1 (Rank Property) Given an r-component mixture of graphical models with Gu = 
^'h=iGh, for any u,v £ V such that {u,v) ^ G\j and S := S{u,v;G\j), the probability matrix 
^u,v,{S;k} •= [P[Yu = ijYv = j,Ys = k]]ij has rank at most r for any k G y'^' . 

Proof: From Fact [H Gu satisfies Markov property conditioned on the latent factor H, 

Yu±Y,\Ys,H, V(n,t>)^Gy. (7) 

This implies that 

Mu,v,{S;k} = M^\H,{S;k} Diag(7r^|{S;fe})Af jH,{S;fc}^(Ys = k), (8) 

where M^\H^s,S;k} ■= [P[Yu = i\H = j,Ys = k]]ij and similarly M^\H,{S;k} is defined. Diag(7rj:^|5.^,) 
is the diagonal matrix with entries t^hiis-M •= [P{H — ^V^ s = k)]i. Thus, we have Rank(M„,; 15.^,1) 
is at most r. □ 

Thus, the effect of marginalizing the choice variable H is seen in the rank of the observed 
probability matrices M^^ig-i^y Thus, when u and v are non-neighbors in Gu, a separator set S 

7 



can be found such that the rank of M^^^fg.f,! is at most r. In order to use this result as a criterion 
for inferring neighbors in Gy, we require that the rank of M^ySg-f^x for any neighbors {u,v) G G\j 
be strictly larger than r. This requires the dimension of each node variable d > r. We discuss in 
detail the set of sufficient conditions for correctly recovering G\j in Section I3.1.1[ 

Tractable Graph Families: Another obstacle in using Lemma [T] to estimate graph Gij is 
computational: the search for separators S for any node pair u,v € V is exponential in \V\ := p 
if no further constraints are imposed. We consider graph families where a vertex separator can be 
found for any (u, v) ^ Gy with size at most rj 

\S{u,v;Gu)\ <ri, V(u,t;)^Gu. 
There are many natural families where t] is small: 

1. If Gu is trivial (i.e., no edges) then r/ = 0, we have a mixture of product distributions. 

2. When Gy is a tree, i.e., we have a mixture model Markov on the same tree, then r] = 1, since 
there is a unique path between any two nodes on a tree. 

3. For an arbitrary r-component tree mixture, Gy = U/jT/i where each component is a tree 
distribution, we have that rj < r (since for any node pair, there is a unique path in each of 
the r trees {Th}, and separating the node pair in each Th also separates them on Gy). 

4. For an arbitrary mixture of bounded degree graphs, we have rj < J2he\r] ^h, where A^ is the 
maximum degree in Gh, i.e., the Markov graph corresponding to component {H = h}. 

In general, r] depends on the respective bounds 77/j for the component graphs Gh, as well as the 
extent of their overlap. In the worst case, rj can be as high as "^^eM Vh, while in the special case 
when Gh = Gu, the bound remains the same ijh = r]. Note that for a general graph Gy with 
treewidth tw(Gu) and maximum degree A(Gu), we have that r/ < min(A(Gu),tw(Gu)). Thus, a 
wide family of models give rise to union graph with small r], including tree mixtures and mixtures 
over bounded degree graphs. 

We establish in the sequel that the computational and sample complexities of our learning 
method scale exponentially in r]. Thus, our algorithm is suitable for graphs Gy with small rj. In 
Section [HI we relax the requirement of exact separation to that of local separation. A larger class 
of graphs satisfy the local separation property including mixtures of locally tree-like graphs. 

Rank Test: We propose RankTest(y"; i^„^p,r/,r) in Algorithm [1] for structure estimation of 
Gu := U^^^G/i, the union Markov graph of an r-component mixture. The method is based on a 
search for potential separators S between any two given nodes u,v G V, based on the effective 
ranlQof ^^v {s-kV ^^ ^^^ effective rank is r or less, then u and v are declared as non-neighbors 
(and set S as their separator). If no such sets are found, they are declared as neighbors. Thus, 
the method involves searching for separators for each node pair u,v G V , by considering all sets 
S <ZV \ {u,v} satisfying |5| < 77. The computational complexity of this procedure is 0{p^~^'^d^), 
where d is the dimension of each node variable Yi, for i G V and p is the number of nodes. This is 
because the number of rank tests performed is 0{p^~^'^) over all node pairs and conditioning sets; 
each rank tests has 0{d^) complexity since it involves performing singular value decomposition 
(SVD) of a d X d matrix. 



The effective rank is given by the number of singular values above a given threshold ^. 



Algorithm 1 G" = RankTest(y";^„_p, ry,r) for estimating Gij := U^^^G/^ of an r-component 
mixture using y" samples, where r] is the bound on size of vertex separators between any node pair 
in Gu and S,n,p is a threshold on the singular values. 

Rank(A;^) denotes the effective rank of matrix A, i.e., number of singular values more than ^. 

^uv\s-k\ '■~ i^^O^u = h^v = j,^s = ^)]j,i is the empirical estimate computed using n i.i.d. 

samples y". Initialize G^ = {V,^). 

For each u,v £ V, estimate M^^ r^.^i from y", if 

min max Rank(M;;^ |5;.|;C„,p) >r, (9) 

sgv\{u,v} k&yi'^i ' 'I ' J 

\s\<n 
then add {u,v) to G". 

3.1 Results for the Rank Test 

3.1.1 Conditions for the Success of Rank Tests 

The following assumptions are made for the RankTest proposed in Algorithm [1] to succeed under 
the PAC formulation. 

(Al) Number of Mixture Components: The number of components r of the mixture model 
and dimension d of each node variable satisfy 

d>r. (10) 

The mixing weights of the latent factor H are assumed to be strictly positive 

TTH{h) := P{H = h) > 0, yh€[r]. 

(A2) Constraints on Graph Structure: Recall that Gy = U^^^G/j denotes the union of the 
graphs of the components and that t] denotes the bound on the size of the minimal separator 
set for any two (non-neighboring) nodes in Gy. We assume that 

\S{u,v;Gu)\<r] = 0{l), \/{u,v) (^ Gu- 

In Section [Bl we relax the strict separation constraint to a local separation constraint in the 
regime of correlation decay, where ry refers to the bound on the size of local separators between 
any two non-neighbor nodes in the graph. 

(A3) Rank Condition: We assume that the matrix My_^y!g.j^-\ in (j4|) has rank strictly greater 
than r when the nodes u and v are neighbors in graph Gy = U^^-j^G/^ and the set satisfies 
1 5*1 < rj. Let pmin denote 

Pmin := ^ min max cj^+i {M^^^s-^k}) > 0, (11) 

ScV\{u,v} 

where ar+i{-) denotes the (r -|- 1)**" singular value, when the singular values are arranged in 
the descending order cri(-) > cr2(-) > . . . crd{-). 

9 



(A4) Choice of threshold ^: For RankTest in Algorithm [H the threshold ^ is chosen as 

^ Pmin 

(A5) Number of Samples: Given 6 £ (0, 1), the number of samples n satisfies 



n 



>nRank(<^;p) :=max j^(21ogp + log(^ ^+log2),f _ ] 1, (12) 



for some t G (0, Pmin) (e.g. t = Pmin/2,) where p is the number of nodes and p^nin is given by 

Assumption (Al) relates the number of components to the dimension of the sample space of the 
variables. Note that we allow for the number of components r to grow with the number of nodes p, 
as long as the cardinality of the sample space of each variable d is also large enough. In principle, 
this assumption can be removed by considering grouping the nodes together and performing rank 
tests on the groups. Assumption (A2) imposes constraints on the graph structure G\j, formed by the 
union of the component graphs. The bound 7] on the separator sets for node pairs in Gy is a crucial 
parameter and the complexity of learning (both sample and computational) depends on it. We relax 
the assumption of separator bound to a criterion of local separation in Section [Bl Assumption (A3) 
is required for the success of rank tests to distinguish neighbors and non-neighbors in graph Gy . It 
rules out the presence of spurious low rank matrices between neighboring nodes in Gy (for instance, 
when the nodes are marginally independent or when the distribution is degenerate). Assumption 
(A4) provides a natural threshold on the singular values in the rank test. In Section |Bl we modify 
the threshold to also account for distortion due to approximate vertex separation, in contrast to the 
setting of exact separation considered in this section. (A5) provides the finite sample complexity 
bound. 

3.1.2 Result on Rank Tests 

We now provide the result on the success of recovering the graph G\j := Uj^^^Gh. 

Theorem 1 (Success of Rank Tests) The RankTest(y"; ^,r/,r) outputs the correct graph Gij := 
^h=iGh, which is the union of the component Markov graphs, under the assumptions (Al)~(A5) 
with probability at least 1 — 6. 

Proof: The proof is given in Appendix O □ 

A special case of the above result is graphical model selection, where there is a single graphical 
model (r = 1) and we are interested in estimating its graph structure. 

Corollary 1 (Application to Graphical Model Selection) The RankTest(y";^,77, 1) outputs 
the correct Markov graph G, given n i.i.d. samples y", under the assumption^ (A2)-(A5) with 
probability at least 1 — 5. 



■''When r = 1, there is no latent factor, and the assumption d > r in (Al) is trivially satisfied for all discrete 
random variables. 
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Remarks: Thus, the rank test is also apphcable for graphical model selection. Previous works 
(see Section 11.21) have proposed tests based on condition al independence, us i ng either conditional 
mutual information or conditional variation distances, see lAnandkumar et al.l (2012g);|Bresler_et_al 



( 20081 ). The rank test above is thus an alternative test for conditional independence. In addition, 
it extends naturally to estimation of union graph structure of mixture components. 



4 Parameter Estimation of Mixture Components 

The rank test proposed in the previous section is a tractable procedure for estimating the graph 
Gu := U^^^G/i, which is the union of the component graphs of a mixture of graphical models. 
However, except in the special case when Gh = G\j, the knowledge of G" is not very useful by 
itself, since we do not know the nature of the different components of the mixture. In this section, 
we propose the use of spectral decomposition tests to find the various mixture components. 



4.0.3 Spectral Decomposition for Mixture of Product Distributions 

The sp ectr a.1 decompo s ition m ethods, first prop osed by Chang ( 19961). and l ater generalized bv lMossel and Roch 
(J2006l ) and IHsu et al.l (120091 ) . and recently by lAnandkumar et al.l (J2012al ). are applicable for mix- 
tures of product distributions. We illustrate the method below via a simple example. 

Consider the simple case of three observed variables {yu,y„,y^}, where a latent 
factor H separates them, i.e., the observed variables are conditionally independent 
given H 



y, X y, X Y^\H. 



H a 



w 



This implies that the Markov graphs {Gh}he[r] of the component models 
{P(Yu,Y^,Yw\H = /i)}/ie[r] are trivial (i.e., have no edges) and thus forms a spe- 
cial case of our setting. 

We now give an overview of the spectral decomposition method. It proceeds by considering 
pairwise and triplet statistics of Yu,Yy,Yw Denote My^^jj := [P{Yu = i\H = j)]ij, and similarly for 
My^ff,M^\ff and assume that they are full rank. Denote the probability matrices Mu^v '■= [P{Yu = 
i,Yy = j)]ij and M^^j^.^} := [P{Yu = i,Yy = j,Yy, = /c)]ij. The parameters (i.e., matrices 
My_\H,My\H^Myj\H) can be estimated as: 



Lemma 2 (Mixture of Product Distributions) For the latent variable model Yu 1- Yy A. 
Yyj\H , when the conditional probability matrices M^^jj^My^jj^M^^ have rankd, let X^ ' = [X^ , 
be the column vector with the d eigenvalues given by 



.X 



(fc)lT 



A(^) := Eigenvalues (M„_^_|^.fe}M„j) , key 



(13) 



Let A := [A^ ^|A' ^| ... |A^ '] be the matrix where the fc"' column corresponds to A^ •' from above. 
We have that 



M 



w\H 



[P{Yy, = i\H = j)]i^, = A 



(14) 



Proof: A more general result is proven in Appendix lD.il □ 

Thus, we have a procedure for recovering the conditional probabilities of the observed variables 

conditioned on the latent factor. Using these parameters, we can also recover the mixing weights 
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tth '■= [P{H = i)]J using the relationship 

Mu,v = M„|^Diag(7rjy)Mj|j^, 

where Diag(7r/f ) is the diagonal matrix with tth as the diagonal elements. 

Thus, if we have a general product distribution mixture over nodes in V, we can learn the 
parameters by performing the above spectral decomposition over different triplets {u,v,w}. How- 
ever, an obstacle remains: spectral decomposition over different triplets {u, v, w} results in different 
permutations of the labels of the hidden variable H. To overcome this, note that any two triplets 
{u,v,w) and {u,v',w') share the same set of eigenvectors in ()13p when the "left" node u is the 
same. Thus, if we consider a fixed node u^^ € V as the "left" node and use a fixed matrix to 
diagonalize (fT3]) for all triplets, we obtain a consistent ordering of the hidden labels over all triplet 
decompositions. Thus, we can learn a general product distribution mixture using only third-order 
statistics. 

4.0.4 Spectral Decomposition for Learning Graphical Model Mixtures 

We now adapt the above method for learning more general graphical model mixtures. We first make 
a simple observation on how to obtain mixtures of product distributions by considering separators 
on the union graph G\j- For any three nodes u,v,w G F, which are not neighbors on Gy, let Suvw 
denote a multiway vertex separator, i.e., the removal of nodes in Suvw disconnects u,v and w in 
G\j . On lines of Fact [H 

Yu ±Y^ ±Yuj\Ys^^^,H, \/u,v,w: {u,v),{v,w),{w,u) ^ Gu- (15) 

Thus, by fixing the configuration of nodes in Suvw, we obtain a product distribution mixture over 
{u, V, w}. If the previously proposed rank test is successful in estimating G\j, then we possess correct 
knowledge of the separators Suvw- In this case, we can obtain estimates {P{Yu}\Ys^^^ = k,H = h)}fi 
by fixing the nodes in Suvw to k and using the spectral decomposition described in Lemma [2l and 
the procedure can be repeated over different triplets {u,v,w}. See FigdJ 



,■• w 



Figure 1: By conditioning on the separator set S on the union graph Gy, we have a mixture of 
product distribution with respect to nodes {u,v,w}, i.e., Y^ X Yy X Yu,\Ys,H. 

An obstacle remains, viz., the permutation of hidden labels over different triplet decompositions 
{n, V, w}. In case of product distribution mixture, as discussed previously, this is resolved by fixing 
the "left" node in the triplet to some Uif £ V and using the same matrix for diagonalization over 
different triplets. However, an additional complication arises when we consider graphical model 
mixtures, where conditioning over separators is required. We require that the permutation of the 
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hidden labels be unchanged upon conditioning over different values of variables in the separator set 
Su,vw This holds when the separator set Su^vw has no effect on node n*, i.e., we require that 

3u^ £ V, s.t. Yu, X Yv\uAH, (16) 

which implies that u,,, is isolated from all other nodes in graph Gy. 

Condition (J16p is required to hold for identifiability if we only operate on statistics over different 
triplets (along with their separator sets). In other words, if we resort to operations over only low 
order statistics, we require additional conditions such as (J16p for identifiability. However, our setting 
is a significant generalization over the mixtures of product distributions, where (1160 is required to 
hold for all nodes. 

Finally, since our goal is to estimate pairwise marginals of the mixture components, in place of 
node w in the triplet {u,v,w} in Lemma [21 we need to consider a node pair a, b £ V. The general 



algorith m allows the variables in the triplet to have different dimensions, see lAnandkumar et al 



(|2012al ) for details. Thus, we obtain estimates of the pairwise marginals of the mixture components. 
The computational complexity of the procedure scales as 0{p^dP~^^r), where p is the number of 
nodes, d is the cardinality of each node variable and rj is the bound on separator sets. For details 
on implementation of the spectral method, see Appendix Rl 

4.1 Results for Spectral Decomposition 

4.1.1 Assumptions 

In addition to the assumptions (A1)-(A5) in Section [3. l.H we impose the following constraints to 
guarantee the success of estimating the various mixture components. 

(A6) Full Rank Views of the Latent Factor: For each node pair a,b G V, and any subset 
S cV\{a,b} with |5| < 2// and k G 3^l'^l, the probability matrix M(„ ;,)|j:^ 15.^} := [P(Ya,b = 
i\H = i, Y5 = k)]i^j G W^^^^ has rank r. 

(A7) Existence of an Isolated Node: There exists a node u^, G V which is isolated from all 
other nodes in G\j = U^^j^G^, i.e. 

Yu,±Yv\u,\H. (17) 

(A8) Spectral Bounds and Random Rotation Matrix: Refer to various spectral bounds 
used to obtain K{6;p,d,r) in Appendix ID. 31 where 6 G (0, 1) is fixed. Further assume that 
the rotation matrix Z £ M^^'' in FindMixtureComponents is chosen uniformly over the Stiefel 
manifold {Q G W'' : Q~^Q = I}. 

(A9) Number of Samples: For fixed 6,€ G (0, 1), the number of samples satisfies 

n>nspect{(>,e;p,d,r) := ^ ' (18) 

where K{5;p,d,r) is defined in ([58]) . 

Assumption (A6) is a natural condit io n required for th e succ ess of spectral decompositio n, and 

is 



Assumption ( Ab ) is a natural condit io n required tor tn e succ ess 01 spectral decompositio n, a: 
is imposed in (JMossel and Rochl . [20061 ). (JHsu et al.[ . [2009 ) and (JAnandkumar et all . [2012al ). It 



also known that learning singular models, i.e., those which do not satisfy (A6), is at least as hard 
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as lea rning parity with noise, which is conjectured to be computationally hard (JMossel and Rochl . 



20061 ) ■ The condition in (A7) is indeed an additional constraint on graph Gy, but is required to 



ensure alignment of hidden labels over spectral decompositions of different groups of variables, 
as discussed beforqj Condition (A8) assumes various spectral bounds and (A9) characterizes the 
sample complexity. 

4.1.2 Guarantees for Learning Mixture Components 

We now provide the result on the success of recovering the tree approximation T/^ of each mixture 
component P(y\H = h). Let || • ||2 on a vector denote the I2 norm. 

Theorem 2 (Guarantees for FindMixtureComponents) Under the assumptions (A1)-(A9), the 
procedure in AlgorithmlE outputs P^^'^^^{Ya,Yh\H = h), for each a,b £V, such that for all h € [r], 
there exists a permutation T{h) G [r] with 

||P^P^^*(y„,n|F = h)- P{Ya,Y,\H = r(/i))||2 < e, (19) 

with probability at least 1 — 45. 

Proof: The proof is given in Appendix [Dl □ 

Remarks: Recall that p denotes the number of variables, r denotes the number of mixture 
components, d denotes the dimension of each node variable and 77 denotes the bound on separator 
sets between any node pair in the union graph. The quantity K{5;p,d,r) in (f58l) in Appendix ID. 31 
is O (p^^"'"^(i^^r^5~^ polylog(p, d, r, 5~^)). Thus, we require the number of samples scaling in (jlSp 
as n = O (^p^^'^^d^'^r^^d^^e^'^ poly log{p, d,r, 6~^)) . Since we operate in the regime where rj = 0(1) 
is a small constant, this implies that we have a polynomial sample complexity in p, d, r. Note that 
the special case of 77 = corresponds to the case of mixture of product distributions, and it has the 
best sample complexity. 

4.1.3 Analysis of Tree Approximation 

We now consider the final stage of our approach, viz., learning tree approximations using the 
estimates of the pairwise marginals of the mixture components from the spectral decomposition 
method. We now impose a standard condition of non-degeneracy on each mixture component to 
guarantee the existence of a unique tree structure corresponding to the maximum-likelihood tree 
approximation to the mixture component. 

(AlO) Separation of Mutual Information: Let T^ denote the Chow-Liu tree corresponding to 
the model P{y\H = h) when exact statistics are input and let 

-& := min min min (I(Yu, YJH = h) - I(Ya, Yb\H = h)) , (20) 

h£[r] {a,b)^Th {u,v)ePa.th{a,b;Th) 

where Path(a, b; Th) denotes the edges along the path connecting a and b in T^. 



®(A7) can be relaxed as follows: if graph Gu has at least three connected components, then we can choose a 
reference node in each of the components and estimate the marginals in the other components. For instance, if 
Ci , C2 , C3 are three connected components in Gu , then we can choose a node in Gi as the reference node to estimate 
the marginals of G2 and G3. Similarly, we can choose a node in G2 as a reference node and estimate the marginals 
in Gi and G3. We can then align these different estimates and obtain all the marginals. 
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(All) Number of Samples: For e*''*^'^ defined in (j69p . the number of samples is now required to 
satisfy 

n>nspect{S,e^""';p,d,r), (21) 

where nspect is given by ([18]). 

The condition in (AlO) assumes a separation between mutual information along edges and non- 
edges of the Chow-Liu tree T^ of each component model P{y\H = h). The quantity "d represents 
the minimum separation between the mutual information along an edge and any set of non-edges 
which can replace the edge in Th- Note that "i? > due to the max- weight spanning tree property 
of Th (under exact statistics). Intuitively "d denotes the "bottleneck" w here errors aremost likely 
to occur in tree structure estimation. Similar observations were made bv lTan et al.l ( 201ll ) for error 



exponent analysis of Chow-Liu algorithm. The sample complexity for correctly estimating Th using 
samples is based on "dh and given in (All). This ensures that the mutual information quantities 
are estimated within the separation bound i?. 

Theorem 3 (Tree Approximations of Mixture Components) Under (Al)-(All), the Chow- 
Liu algorithm outputs the correct tree structures corresponding to maximum-likelihood tree approxi- 
mations of the mixture components {P{y\H = /i)} with probability at least 1—4(5, when the estimates 
of pairwise marginals {P^^^'^^ (Ya,Y{y\H = h)} from spectral decomposition method are input. 

Proof: See Section [P. 51 n 

Remarks: Thus our approach succeeds in recovering the correct tree structures corresponding 
to ML-tree approximations of mixture components with computational and sample complexities 
scaling polynomially in the number of variables p, number of components r and the dimension of 
each variable d. 

Note that if the underlying model is a tree mixture, we recover the tree structures of the mix- 
ture components. For this special case, we can give a slightly better guarantee by estimating 
Chow-Liu trees which are subgraphs of the union graph estimate Gy , and this is discussed in Ap- 
pendix [D.4[ The improved bound K^'^'^^{6;p,d,r) is O [p'^{dA)'^^r^6~^ polj\og{p,d,r,6~^)^, where 
A is the maximum degree in Gy. 

5 Conclusion 

In this paper, we considered learning tree approximations of graphical model mixtures. We proposed 
novel methods which combined techniques used previously in graphical model selection, and in 
learning mixtures of product distributions. We provided provable guarantees for our method, and 
established that it has polynomial sample and computational complexities in the number of nodes 
p, number of mixture components r and cardinality of each node variable d. Our guarantees are 
applicable for a wide family of models. In future, we plan to investigate learning mixtures of 
continuous models, such as Gaussian mixture models. 
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A Implementation of Spectral Decomposition Method 

Overview of the algorithm: We provide the procedure in Algorithm [2j The algorithm com- 
putes the pairwise statistic of each node pair a,b £ V\{u^:}, where n=K is the reference node which is 
isolated in Gy , the union graph estimate obtained using Algorithm [TJ The spectral decomposition 
is carried out on the triplet {n=K,c, {a,b); {S = k}}, where c is any node not in the neighborhood 
of a or 6 in graph G\j- Set S C V \ {a,6, «^,} is separates a, b from c in G\j. See FiglH We fix 
the configuration of the separator set to Y5 = k, for each k G 3^''^', and consider the empirical 
distribution of n samples, P^(Yu,,Ya,Ya,Yc,{Ys = k}). Upon spectral decomposition, we obtain 
the mixture components P^P'^^^(Ya,Yb,Ys\H = h) for h G [r]. We can then employ the estimated 
pairwise marginals to find the Chow-Liu tree approximation {Th}h for each mixture component. 
This routine can also be adapted to estimate the individual Markov graphs {Gh}h and is described 
briefly in Section lA.ll Also, if the underlying model is a tree mixture, we can slightly modify the 
algorithm and obtain better guarantees, and we outline it in Section [A. 1[ 




•-'14* 



Figure 2: By conditioning on the separator set S on the union graph Gy, we have a mixture of 
product distribution with respect to nodes {n=i,,c, (a, 6)}, i.e., Yu^ l.Yc 1- Ya^b\YsiH. 



Algorithm 2 FindMixtureComponents(y", G;r) for finding the tree-approximations of the compo- 
nents {P{y\H = h)}h of an r-component mixture using samples y" and graph G, which is an 
estimate of the graph G\j := U^f^^^Gh obtained using Algorithm [TJ 



M 



A,B,{C;k} 



[P{Ya = i-i^B = ji^c = k]ij denotes the empirical joint probability matrix 
estimated using samples y", where A r\ B r\ G = 0. Let S{A,B]G\j) be a minimal vertex 
separator separating A and B in graph G\j. 

Choose a uniformly random orthonormal basis {zi,...,^^} £ I^*"- Let Z G W^'^ be a matrix 
whose l^^ row is z^. 

Let u=K G y be isolated from all the other nodes in graph G. Otherwise declare fail. 
for a, 6 G y \ {u*} do 

Let c ^ M{a\G) \J J\f{b\G) (if no such node is found, go to the next node pair). S ^ 

S^{a,b),c;G). 

{psvect^Ya,Yb,Ys\H = h)}h ^ SpecDecomiu,,c,ia,by,S,y-,r,Z). 
end for 
for h G [r] do 

n,{P'''%Ya,Yb\H = /i)}(,,,)gfj ^ ChowLiuTree ({P«P-t(y„ n|^ = M}a,6ey\W}). 
end for 



Output [n^'^'^ih), fh, {p''''%Ya,Yk\H = h) : {a,b) G f),} 



/leP 
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Procedure 3 [{P(Yw,Ys\H = h),7rH{h)}h] <— SpecDecom(u,?;,i(;; 5, y",r, Z) for finding the com- 
ponents of an r-component mixture from y'^ samples at w, given witnesses u, v and separator S on 
graph G". 

Let M^^ rg.^i := [P"'{Yu = i,Yy = j, Y5 = k)]ij where P^ is the empirical distribution computed 

using samples y". Similarly, let M^^y^{S;k},{w;i} •= [P"'(Xu = i,Yy = j,Ys = k,Y^ = l)]ij. For a 

vector A, let Diag(A) denote the corresponding diagonal matrix. 

for k G y\^\ do _ 

Choose Uu as the set of top r left orthonormal singular vectors of Af"^ (s-fci ^^^ ^ ^^ ^^^ right 
singular vectors. Similarly for node w, let Uw be the top r left orthonormal singular vectors 

for I G [r] do 

m, ^ C/^z,, A ^ C/„^M;^,,,|<j,,|K and S, ^ Uj (E,"^/(9)M^,.,|5;.},{^„;,}) K- 
if ^ is invertible (Fail Otherwise) then 

Ci ^ BiA-^. Diag(A(^)) ^ R-^QR. {Find i? which diagonahzes Q for the first triplet. 
Use the same matrix R for all other triplets.} 
end if 
end for 

Form the matrix from the above eigenvalue computations: A = [A' '\X^ I • • • i-^ ] 
Obtain M^i^j^.^} -^ UwZ~^A^ . Similarly obtain My\H,{S:k}- 

Obtain nn: K,w,{S;k} = M.|i^,{5;fc} Diag(^/f|{5;fc})(^«,|H,{5;fe})^^"(Y5 = k). 
end for 
Output {P{Y^,Ys\H = h),nH{h)}he\r]- 



A.l Discussion and Extensions 

Simplification for Tree Mixtures {Gh = T^): We can simplify the above method by limiting 
to tree approximations which are subgraphs of graph Gy • This procedure coincides with the original 
method when all the component Markov graphs {Gh}h are trees, i.e., Gh = Th, h £ [r]. This is 
because in this case, the Chow-Liu tree coincides with Th C Gy (under exact statistics). This 
implies that we need to compute pairwise marginals only over the edges of Gy using SpecDecom 
routine, instead of over all the node pairs, and the ChowLiuTree procedure computes a maximum 
weighted spanning tree over Gy, instead of the complete graph. This leads a slight improvement 
of sample complexity, and we note it in the remarks after Theorem [2l 

Estimation of Component Markov Graphs {Gh}h' We now note that we can also esti- 
mate the component Markov graphs {Gh} using the spectral decomposition routines and we briefly 
describe it below. Roughly, we can do a suitable conditional independence test on the estimated 
statistics P^^^^^iYj^, ,^ AH = h) obtained from spectral decomposition, for each node neighbor- 
hood Af[a;G\j], where a £ V \ {u^:} and Gy is an estimate of G\j := Uh£[r]Gh- We can estimate 
these statistics by selecting a suitable set of witnesses G := {ci, C2, . . . , } such that M[a\ can be 
separated from G in Gu. We can employ Procedure SpecDecom on this configuration by using a 
suitable separator set and then doing a threshold test on the estimated component statistics P^p^^^ ; 
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Procedure 4 [T,{P^'''''{Ya,Yb)}^^i^-^^f] ^ ChowL\uJree{{P {Ya,Yb)}a,beV\{u.} for finding a tree 
approximation given the pairwise statistics. 
for a,b ^V \ {u^} do 

Compute mutual information I{Ya;Yb) using P{Ya,Yb). 
end for 

T ^ MaxWtTree({/(ya;^6)}) is max-weight spanning tree using edge weights {I(Ya;Yb)}. 
for (a, 6) e f do 

P'''^{Ya,Yb) ^ P{Ya,Yb). 
end for 



if for each (a, b) G Gu, the following quantity 

^in\\P'P-^\Ya\Yb = A:, Y^(,)\fc =y,H = h)- P'^'^^'iY^lYb = l,Y^^a)\b = y,H = h)\\„ 

is below a certain threshold, for some y G yl-^wy^l ^ then it is removed from Gu, and we obtain Gh 
in this manner. A sirn ilar test was used for graphical model selection (i.e., not a mixture model) 
in ( Bresler et al.l . l2008l ). We note that we can obtain sample complexity results for the above test, 



on lines of the analysis in Section [4. II and this method is efficient when the maximum degree in Gu 
is small. 

B Extension to Graphs with Sparse Local Separators 

B.l Graphs with Sparse Local Separators 

We now extend the analysis to the setting where the graphical model mixture has the union graph 
Gu with sparse local separators, which is a weaker criterion than having sparse exact separators. We 



now p rovide the definition of a local separator. For detailed discussion, refer to (jAnandkumar et al. 



2012d). 

For 7 G N, let B^{i; G) denote the set of vertices within distance 7 from i with respect to graph 
G. Let F^^i := G{Bry{i)) denote the subgraph of G spanned by B^{i] G), but in addition, we retain 
the nodes not in B^{i) (and remove the corresponding edges). 

Definition 1 (7-Local Separator) Given a graph G, a 7-local separator 5iocai(^)i; G, 7) between 
i and j, for {i,j) ^ G, is a minimal vertex separatoi^ with respect to the subgraph Fy^i. In addition, 
the parameter 7 is referred to as the path threshold for local separation. A graph is said to be 
r]-locally separable, if 

max |5iocai(«,j;G',7)| < ??. (22) 

A wide family of graphs possess the above property of sparse local separation, i.e., have a 
small T]. In addition to graphs considered in the previous section, this additionally includes the 
family of locally tree- like graphs (including sparse random graphs), bounded degree graphs, and 
augmented graphs, formed by the union of a bounded deg ree graph and a locally tre e-like graph 
(e.g. small-world graphs). For detailed discussion, refer to ( Anandkumar et al.l . l2012d ). 



^A minimal separator is a separator of smallest cardinality. 
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B.2 Regime of Correlation Decay 

We consider learning mixtures of graphical models Markov on graphs with sparse local separators. 
We assume that these models are in the regime of correlation decay, which makes learning feasible 
via our proposed methods. Technically, correlation d ecay can be defined in multiple ways and we 
use the notion of strong spatial mixing ( Weita . l2006l ). A weaker notion is known as weak spatial 



mixing. 

A graphical model is said to satisfy weak spatial mixing when the conditional distribution at 
each node v is asymptotically independent of the configuration of a growing boundary (with respect 
to v). It is said to satisfy strong spatial mixing, when the total variation distance between two 
conditional distributions at each node v, due to conditioning on different configurations, depends 
only on the graph distance between node v and the set where the two configur ations differ. We 
formally define it belo'VMJ and incorporate it to provide learning guarantees. See ( Weita . l2006l ) for 
details. 

Let P{Yy\YA',G) denote the conditional distribution of node v given a set A C F \ {v} under 
model P with Markov graph G. For some subgraph F C G, let P{Yy\'YA', F) denote the conditional 
distribution on corresponding to a graphical model Markov on subgraph F instead of G, i.e., by 
setting the potentials of edges (and hyperedges) in G \ -F to zero. For any two sets Ai,A2 C V, 
let dist(^i,^2) := ^^u£Ai,vgA2 dist{u,v) denote the minimum graph distance. Let Bi{v) denote 
the set of nodes within graph distance / from node v and dBi{v) denote the boundary nodes, i.e., 
exactly at I from node v. Let Fi{v;G) := G{Bi{y)) denote the induced subgraph on Bi{v;G). For 
any vectors a, b, let ||a — b||-|^ := a/X^i 1^(0 ~ b{i)\ denote the £i distance between them. 

Definition 2 (Correlation Decay) A graphical model P Markov on graph G = {V, E) with p 
nodes is said to exhibit correlation decay with a non-increasing rate function (^(■) > if for all 

I, pen, 

max \\P{Y,\Ya = Ya; G) - P{Yv\Ya = Ya; Fi{i; G))]]^ = C{dist{A, dBi{i))). (23) 

A(lV\{v} 

Remarks: 



1 . In (I23p , if we consider the marginal distribution of node v instead of its conditional distribution 
over all sets j4, then we have a weaker criterion, typically referred to as weak spatial mixing. 
However, in order to provide learning guarantees, we require the notion of strong mixing. 

2. For the class of Ising models (binary variables), the regime of correlation decay can be explic- 
itly characterized, in terms of the maximum edge potential of the model. When the maximum 
edge potential is below a certain threshold, the model is said to be in the regime of corre- 
latio n decay. The threshold th at can be explicitly characterized for certain graph families. 
See ( Anandkumar et al.l . l2012d ) for derivations. 



*We slightly modify the definition of correlation decay compared to the usual notion by considering models on 
different graphs, where one is an induced subgraph of the neighborhood of the other graph, instead of models with 
different boundary conditions. 
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B.3 Rank Test Under Local Separation 

We now provide sufficient conditions for the success of RankTest(y"; i^„^p, rj, r) in Algorithm[TJ Note 
that the crucial difference compared to the previous section is that ?y refers to the bound on local 
separators in contrast to the bound on exact separators. This can lead to significant reduction in 
computational complexity of running the rank test for many graph families, since the complexity 
scales as 0{p'^'^'^d?) where p is the number of nodes and d is the cardinality of each node variable. 

(Bl) Number of Mixture Components: The number of components r of the mixture model 
and dimension d of each node variable satisfy 

d>r. (24) 

The mixing weights of the latent factor H are assumed to be strictly positive 

TTHih) := P{H = h) > 0, V/ie[r]. 

(B2) Constraints on Graph Structure: Recall that Gy = U^^^G/j denotes the union of the 
Markov graphs of the mixture components and we assume that Gy is 77-locally separable 
according to Definition [H i.e., for the chosen path threshold 7 G N, we assume that 

\Sioca.iiu,v;Gu,'y)\ <r] = 0(1), \/{u,v) ^ Gu- 

(B3) Rank Condition: We assume that the matrix M^ ,^ |5.y;.| in (JH) has rank strictly greater 
than r when the nodes u and v are neighbors in graph Gy = U}^^^G/i and the set satisfies 
|5| < r/. Let pmin denote 

Pmin := min max ar+i (M is:k}) > 0- (^5) 

(u,v)eGu,\s\<'nk(^y\s\ 

ScV\{u,v} 

(B4) Regime of Correlation Decay: We assume that all the mixture components {P{y\H = 
^; Gh)}he[r] are in the regime of correlation decay according to Definition[2]with rate functions 
{Chi-)}he[r]- Let 

C(7):=2\/dmaxa(7)- (26) 

h&[r] 

We assume that the minimum singular value pmm in dH]) and ^(7) above satisfy pmm > Cil)- 
(B5) Choice of threshold ^: For RankTest in Algorithm [H the threshold ^ is chosen as 

^ .^ Pmin - Cil) ^ Q 

where ^(7) is given by (j26|) and /Jmin is given by pT|) . and 7 is the path threshold for local 
separation on graph Gy. 

(B6) Number of Samples: Given an 5 > 0, the number of samples n satisfies 

n > nLRank(5;ij) :=max I ^ (21ogp + log(5"^ +log2) , f — _.,.._ J 1, (27) 

where p is the number of nodes, for some t S (0, pmin — Cil))- 
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The above assumptions (B1)-(B6) are comparable to assumptions (Al)-(A5) in Section 13.1.11 
The conditions on r and d in (Al) and (Bl) are identical. The conditions (A2) and (B2) are 
comparable, with the only difference being that (A2) assumes bound on exact separators while 
(B2) assumes bound on local separators, which is a weaker criterion. Again, the conditions (A3) 
and (B3) on the rank of matrices for neighboring nodes are identical. The condition (B4) is an 
additional condition regarding the presence of correlation decay in the mixture components. This 
assumption is required for approximate conditional independence under conditioning with local 
separator sets in each mixture component. In addition, we require that (^(7) < yOmin- In other 
words, the threshold 7 on path lengths considered for local separation should be large enough (so 
that the corresponding value (^(7) is small). (B5) provides a modified threshold to account for 
distortion due to the use of local separators and (B6) provides the modified sample complexity. 

B.3.1 Success of Rank Tests 

We now provide the result on the success of recovering the union graph Gy := Ul^^^Gh for r/-locally 
separable graphs. 

Theorem 4 (Success of Rank Tests) The RankTest(y"; ^,r/,r) outputs the correct graph Gij := 
U^^-^G/i, which is the union of the component Markov graphs, under the assumptions (B1)~(B6) 
with probability at least 1 — 5. 

Proof: See Appendix [Cl □ 

B.4 Results for Spectral Decomposition Under Local Separation 

The FindMixtureComponents(y", G; r) procedure in Algorithm [2] can also be implemented for graphs 
with local separators, but with the modification that we use local separators 5iocai((fl5 b), c; G), as 
opposed to exact separators, between nodes a, b and c under consideration. We prove that this 
method succeeds in estimating the pairwise marginals of the component model under the following 
set of conditions. We find that there is additional distortion introduced due to the use of local 
separators in FindMixtureComponents as opposed to exact separators. 

B.4.1 Assumptions 

In addition to the assumptions (B1)-(B6), we impose the following constraints to guarantee the 
success of estimating the various mixture components. 

(B7) Full Rank Vie'ws of the Latent Factor: For each node pair a, 6 G y, and any subset 
S (ZV\{a,b} with |5| < 2ri and k e y\^\ the probability matrix M(a f,)!^;^!^.^} := [P^Yafi = 
i\H = j,Ys = k)]ij G R'^'^^'^ has rank r. 

(B8) Existence of an Isolated Node: There exists a node Uif € V which is isolated from all 
other nodes in G\j = U^^-^G/j, i.e. 

Yu,±Yv\u,\H. (28) 

(B9) Spectral Bounds and Random Rotation Matrix: Refer to various spectral bounds 
used to obtain K{5;p,d,r) in Appendix ID. 31 where 5 G (0, 1) is fixed. Further assume that 
the rotation matrix Z € M^'^** in FindMixtureComponents is chosen uniformly over the Stiefel 
manifold {Q G M^^*^ : Q^Q = /}. 
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(BIO) Number of Samples: For fixed 5 £ (0, 1) and e > eo , the number of samples satisfies 

n > niocai-spectio, e;p,d,r) := — -2 — , (29) 

(e - eo) 

where 

eo:=2K'{6;p,d,r)C{7), (30) 

and K'{5;p,d,r) and K{5;p,d,r) are defined in (j57|) and (j58]) . and (^(7) is given by ([26]) . 

The assumptions (B7)-(B9) are identical with (A6)-(A8). In (BIO), the bound on the number of 
samples is slightly worse compared to (A9), depending on the correlation decay rate function C(7)- 
Moreover, the perturbation e now has a lower bound eo in (|30|) . due to the use of local separators 
in contrast to exact vertex separators. As before, below, we impose additional conditions in order 
to obtain the correct Chow-Liu tree approximation Tfi of each mixture component P{y\H = h). 

(Bll) Separation of Mutual Information: Let Th denote the Chow-Liu tree corresponding to 
the model P{y\H = h) when exact statistics are inpulo and let 

t?:=min min min [I {Yu,YJH = h) - I {Ya,Yi,\H = h)) , (31) 

/ie[r] (a,6)^Th (u,-u)ePath(a,6;Th) 



where Path(a, 6; Th) denotes the edges along the path connecting a and h in Th. 
(B12) Constraint on Distortion: For function (/>(•) defined in (I66p in Appendix ID. 51 and for 

3d 



some r G (0,0. 5t?), let e*'''^° := </> ^ ( ^^sd ^ ) ^ ^O) where eo is given by ([30|) . The number of 



samples is now required to satisfy 

n > niocal-spect (<^, e^"""" ;p,d,r), (32) 

where niocai-spect is given by i^. 

Conditions (Bll) and (B12) are identical to (AlO) and (All), except that the required bound 
gtree ^^ (B12) is required to be above the lower bound eo in ([5U|) . 

B.4.2 Guarantees for Learning Mixture Components 

We now provide the result on the success of recovering the tree approximation Th of each mixture 
component P{y\H = h) under local separation. 

Theorem 5 (Guarantees for FindMixtureComponents) Under the assumptions (Bl)-(BIO), the 
procedure in Algorithmic outputs P^'^^'^^{Ya,Yi,\H = h), for a, 6 € ^ \ {u*}, with probability at least 
1 — 4(5, such that for all h £ [r], there exists a permutation T{h) G [r] with 

||Pspcct(y^^y^|^ = /i) - P{Ya,Y,\H = T{h))\\2 < €. (33) 

Moreover, under additional assumptions (B11)-(B12), the method outputs the correct Chow-Liu 
tree Th of each component P{y\H = h) with probability at least 1 — A6. 

Remark: The sample and computational complexities are significantly improved, since it only 
depends on the size of local separators (while previously it depended on the size of exact separators). 



'Assume that the Chow-Liu tree Th is unique for each component h £ [r] under exact statistics, and this holds 
for generic parameters. 
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C Analysis of Rank Test: Proof of Theorem [T] and [4] 



Bounds on Empirical Probability: We first recap the result from ( Hsu et alJ . l2009l . Propo- 
sition 19), whicli is an application of the McDiarmid's inequality. Let ||-||2 the (.2 norm of a vector. 

Proposition 1 (Bound for Empirical Probability Estimates) Given empirical estimates P^ 
of a probability vector P using n i.i.d. samples, we have 



P[||P" - PII2 > e] < exp -n{€-l/./^f , Ve > 1/V^, 



(34) 



Remark: The bound is independent of the cardinality of the sample space. 

This implies concentration bounds for M„j, 15.^1. Let || • II2 and || • ||f denote the spectral norm 
and the Frobenius norms respectively. 

Lemma 3 (Bounds for M^j^/^.^ji) Givenn i.i.d. samples y'^, the empirical estimate M^^ rg,^^, := 
[P"[i; = i,Yy= j,Ys = k]]ij satisfies 



max Wl{K,v,{S;k}'^ - '^l(^^MS;k})\ > 



Proof: Using proposition [H we have 



< 



exp 



-n(e-l/Vn)^ , Ve>l/v/^. (35) 



[max ||P"(y„,y„, Ys = k)- P{Yu,Y,,Ys = k)\\^ > e] < exp -n (e - l/V^Y 



In other words, 



i^^^^^J^uMS-M - ^u,v,{S;k}\\F > e] < exp 



-n e 



i/v^y 



, e > 1/Vn. 



e > 1/Vri. 
(36) 



(37) 



Since \\A\\2 < ||^||jr for any matrix A and applying the Weyl's theorem, we have the result. □ 

Prom Lemma [1] and Lemma [3l it is easy to see that 

P[GS / Gu] < 2/ exp [-n [p^j2 - l/^f 

and we have the result. Similarly, we have Theorem [5] from Lemma [TT] and Lemma [3l D 



D Analysis of Spectral Decomposition: Proof of Theorem [2] 

D.l Analysis Under Exact Statistics 

We now prove the success of FindMixtureComponents under exact statistics. We first consider three 
sets ^1,^2, ^3 C V such that N[Ai\Gyj] r\M[Aj\Gij\ = for i,j G [3] and Gu := U/jg[r]G/i is the 
union of the Markov graphs. Let S gV\ UjAj be a multiway separator set for Ai,A2, A^ in graph 
Gu- For Ai, i G {1,2,3}, let f/j € M ' ^^ be a matrix such that U^ Mj^.^H^^g-i^y is invertible, for a 
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fixed k G 3^' L Then f^i^Myij^2,{5;fc}^2 is invertible, and for all m G M'^ ^ , the observable operator 
C'(m) G M''^'", given by 



C{m) := (u^ (Y,rn{q)MA,,A,,{S-,k},{A,-,g}\ f^2 j {U^ MA,,A,,{S;k}U2] 



(38) 



Note that the above operator is computed in Sp ecDecom procedure. We now provide a generaliza- 
tion of the result in ( Anandkumar et alj . l2012bl ). 



Lemma 4 (Observable Operator) Under assumption (A6), the observable operator in (j38p sat- 
isfies 

C(m) = {uJMj,^\h,{S-m) Diag (m^ .^^^^..^m) {U^ Mj^^^has-mY^ ■ (39) 

In particular, the r roots of the polynomial A i— )• det(C(m) — A/) are {(m, M^^ij:^ /g.^iej) : j G [r]}. 
Proof: We have 

f^7Af^„A2,{5;fc}f/2 = (f/l^M^^|^,|5.;,|) Diag(7r^,|5.fc|)(M^|^^^5.,jC/2) 

on lines of ([8]), which is invertible by the assumptions on C/i, U2 and Assumption (A6). Similarly, 

and we have the result. □ 

The above result implies that we can recover the matrix MMu!g.f^\ for any set ^ C y, by 
using a suitable reference node, a witness and a separator set. We set the isolated node u^, as the 
reference node (set Ai in the above result). Since we focus on recovering the edge marginals of the 
mixture components, we consider each node pair a,b € V \ {u*} (set A3 in the above result), and 
any node c ^ M{a; Gu)uM{b; G\j) (set A2 in the above result), where Gy := U/ig[f,]G/i, as described 
in FindMixtureComponents. Thus, we are able to recover M^uhjs-M under exact statistics. Since 
Ys are observed, we have the knowledge of ^(Ys = k), and can thus recover M^^^j as desired. 
The spectral decompositions of different groups are aligned since we use the same node u*, and 
since n* is isolated in Gy, fixing the variables Y5 = k has no effect on the conditional distribution 
of Yu.^, i.e., P{YujH,Ys = k) = P(Yu,\H). Since we recover the edge marginals M^^h correctly 
we can recover the correct tree approximation T/j, for /i G [r]. 

D.2 Analysis of SpecDecom (m, v, w; S) 

We first consider the success of Procedure SpecDecom{u, v,w;S) for estimating the statistics of tt; 
using node u G ^ as the reference node (which is conditionally independent of all other nodes given 
H) and witness v £ V and separator set S. We will use this to provide sample c omplexity results on 



FindM ixtureComponents using union bounds. The proof borrows heavily from (jAnandkumar et al. 
2ni2bh . 



Recall that Uu is the set of top r left orthonormal singular vectors of M^,^ is-k\ ^^^ ^ ^^ ^^^ 
right orthonormal vectors. For / G [r], let m/ = UwZi, where z; is uniformly distributed in E^~^ and 
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Uw is the top r left singular vectors of Af^^ is-k\- ^^ Lemma [T3l we have that Uj M^^^^^g.^^yVy is 
invertible. Recall the definition of the observable operator in (j38p 



Q := Cinii) = Ul \Y,'^M)^u,v,{S;k},{n,;q} j K (u^ M^^^^^s.^^V,^ ^ , (40) 



where exact matrices M are used. Denote Ci when the sample versions M" are used 



(41) 



g 
We have the following result. 

Lemma 5 (Bounds for \\Ci — C/Hg) The matrices Ci and Ci defined in (j40p and (j4ip satisfy 

„^ ~„ ^2||^^mKg)(M^_^_^^.^^_^^.g^-M„,^,|5.fc},l^.g|)||2 

IlL-Z - Oi||2 < j-j- r 

'^\\T.qmi{(l)^u,v,{S;k}A^n}\\2Wu,vAS;k} ~ ^«,^.{S;fc} II2 ^A 

0"r(M„_„,{S;fc})^ 

Proof: Using Lemma [14] and Lemma [H □ 

We now provide perturbation bounds between estimated matrix MyJ\u!s.l^^ and the true matrix 
M^H,{S;k}- Define 

p{w) := mill mmniin |(zW, [/J;M„I^|5 fc}(ej - ej/))| (43) 

key\s\ lelr] jj^y 

Amax(^^) := max \{z^'\U^M^iH,{S;k}^j)\, (44) 

where z; is uniformly distributed in S^^^. 

Lemma 6 (Relating M.^ij^fg.f.'t and Myjifjig-j^'t) The estimated matrix M,^ijj!g.f^-j using sam- 
ples and the true matrix M^ij^^j^.^j satisfy, for all j G [r], 

^^^u,w,{S;k} ~ ■^u,w,{S;k}\\2 

\\Mw\H,{S;k}ej - M^\H,{S;k}eT{j)\\2 < '^\\Mw\H,{S;k}eT{j)\\2 ' /\^ x 

^r{^V-^u,w,{S;k}) 

+ (l2V^- ac(M„|h)' + 256r2 • k(M^\h)^ ■ A^ax(w^)//3(«;)) • ||Q - Qlb- (45) 

Proof: Define a matrix i? := Uj M^^jj Diag(||C/J^M^|ji^ei||2, . . . , \\Uj Mu^jjer\\2)~^ ■ Note that R 
has unit norm columns and R diagonalizes Ci, i.e., 

i?-iQi? = Diag(MT^^^^^,jzO. 

Using the fact that for any stochastic matrix dxr matrix A, \\A\\2 < -v/r|| A||-^ = -y/r and Lemma [TT} 
we have 

\\R''\\2 < 2k{uJm^\h), ^{R) < MMu\h)- 
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From above and by Lemma [T6l there exist a permutation r on [r] such that, for all j,l G [r], 

|A«(J) - A('Hr(i))| < (sKiR) + 16ri-5 • /.(i?) • \\R-Y2 ■ Amax(^)//3(^)) • ||Q - QII2 

< (i2k(A4|^)2 + 256ri-5 • k(M„|^)4 • A^ax(^)//3(i«)) • ||Q - Qlb, (46) 

where P{w) and Amax(w) are given by (j43]) and (p}. Let V^^'^ := (A^^Hj), A(2)(j), . . . , A(^)(j)) G M'' 
be the row vector corresponding to j**" row of A and i^U' := (A'^'(j), A'^'(j), . . . , A'''^(j)) G M^. 
Observe that i^^' = ZU^^j rg.^My^\fj,{S;k}(^j- By the orthogonality of Z, the fact \\v\\2 < v^||'?||oo 
for iT G M^, and the above inequality, 

= ||Z-1(9(-'')-z7Wj)))||2 
= ||j)(i) _ j7(^(i))||2 
<^.||p(i)_j7MJ-))||^ 

< (l2V^ • k(M„|h)' + 256r2 • k(M„|^)4 • Xr^^.{w)/(3{w)^ ■ ||Q - C/lb- 
By Lemma [T3] (as applied to M"^ r^.j^, and A^u,u,,{5;fc}), we have 

+ 911 /If P II "^^"V{^;A^}~^^"-'"'i^-^}"^ a?^ 

' ^ -^ ^^ f^r(M„_^^{5.fc|) 

D 

D.3 Analysis of FindMixtureComponents 

We now provide results for Procedure FindMixtureComponents by using the previous result, where 
w is set to each node pair a, 6 G V^ \ {u^,}. We condition on the event that G[j = Gy, where 
G\j := U/ig[r]Gh is the union of the component graph. 

We now give concentration bounds for /3 and A^ax in dMD and dM]). Define 

amin := mill min niin||M( b)|/f,{5;fc}(^» ~ ^'^Olb (^8) 

a,6en{«.} fceylSI,|S|<2r; »7^*' 
5cy\{a,6,M*} 

amax := max max max ||M(„ b)|/f,|5;fc|ej||2, (49) 

ScV\{a,6,M,} 

and let 

a := . (50) 

Omin 

Lemma 7 (Bounds for /3 and Amax) Fix 6 G (0, 1), given any a,b €z V \ {u*} and any set S C 
V \ {a, 6, u*} mi/i jS*! < 277, we have with probability at least 1 — 5, 

Amax(a,6) < ^(1 + V21n(rV(prf)2V'^)) (52) 
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This implies that with probability at least 1 — 25, 

^^^ > ^rVipd^ (l + V21n(rVM^V'^)) , (53) 

where a is given by ([50]) . 

Similarly, we have bounds on \\M^ ab^s-k} ~ ^u,,a,b,{S;k}\\2 using Lemma [3] and union bound. 

Proposition 2 (||M" ^^r^.;,! — -^M,,a,fe.{5;fc}ll2) W'i'ih probability at least 1 — 6, we have, for all 
a,beV\{u^}, ScV\{a,'b,u^}, \S\<2^, 



\MZ,aMS;k} - Mu,,aMS;k}h ^^i^ + \/l°g (^^^) j . (54) 



Define p'^^i^, P2,min and p'^ax as 



^i.min := ^ min ^ mm ct,. {M,^,,v,{S;k}) , (55) 

|5|<2»?,fceyl^l 
P2,mm := „ , min min CTr {M^,^aMS;k}) ■ (56) 

ScV\{ut,a,b} a,b£V\{u,} 



Using the above defined constants, define 

K'{6;p, d, r) :=1024 • k{M^ih)^ ■ .^:(^r^p\pdf'^ (l + ^j2\n{r^p'^{pdY^ /5) 

+ 48^.k(M„|^)2 + ^, (57) 

"l^min "2,111111 

and 



K{5- p, d,r):= K'{6; p,d,r)(l + Jlog f^!!^lfl\ j . (53) 

We can now provide the final bound on distortion of estimated statistics using all the previous 
results. 

Lemma 8 (Bounds for \\Mafi\H,{S;k}^j - ^a,6|ff,{S;fe}erO)l|2) For any a,b eV\ {«*}, k G y\^\ , 
j £ [r], there exists a permutation t{j) € [r] such that, conditioned on event that Gyj = Gyj, with 
probability at least 1 — 2>5, 

-^"^ K{o' T) d r) 

\WaMH,{S-k}^j - ^aMH,{S;k}^T{j)\\2 < '—?=-^ • (59) 



This implies 



n 



IM p M P II ^, ^(^;P>^.0 I Ki6;p,d,r) ^ 2K{5-p,d,r) 
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Resul ts on Random Rotation Matrix: We also require the following result from (jAnandkumar et alj . 
2012bl ). The standard inner product between vectors u and v is denoted by {u, v) = u^v. Let <Ti{A) 
denote the i**" largest singular value of a matrix A. Let S"*""^ := {u G M"^ : ||'u||2 = 1} denote the 
unit sphere in M."^. Let Cj G M denote the i**" coordinate vector where the i**" entry is 1, and the 
rest are zero. 



Lemma 9 Fix any 5 G (0, 1) and matrix A £ 
distributed uniformly over S^~^ . 



(with m < n). Let £ 



be a random vector 



1. Pr 



2. Pr 



minK^-,yl(el-e-)>|>^^?^^^ 



> 1-5. 



"iielml \{e,Aei)\< 



'^^^ll2/i + ^21n(m/<5) 



m 



> 1-5. 



D.4 Improved Results for Tree Mixtures 

We now consider a simplified version of FindMixtureComponents by limiting to estimation of pairwise 
marginals only on the edges of G\j, where G\j is the estimate of G\j := U/jg[r]G/i, which is the 
union of the component graph, as well as constructing the Chow-Liu trees T^ as subgraphs of Gyj. 
Thus, instead of considering each node pair a, 6 S F \ {«*}, we only need to choose (a, 6) G Gyj. 
Moreover, instead of considering S <Z V \ {a,6, n*}, we can follow the convention of choosing 
S C M{a; G\j) U M{b; G\j), and this changes the definition of amin, Omax, Pi min' P'2 mm ^^^ s° °^- 
For all (a, b) G Gu, let 

A2 := max \M{a;Gu)UAf{b;Gu)\. (61) 

(a,fe)eGu 

We have improved bounds for /3 and Amax defined in (1430 and ()44|) . when A2 is small. 



Lemma 10 (Improved Bounds for /? and Amax) Fix 6 G (0,1), when\S\ < 2r? and S c7\A(a;Gu)U 
M{b; Gu), with probability at least 1 — 5, 



I3{w)> 



Amax(w^) < 



\/2amin5 



eri'^jrp'^d'^^A 



2rj 
2 



Oti 



1 + A/21n(rV 



(62) 
(63) 



We can substitute the above result to obtain a better bound K^^^^{5;p,d,r) for learning tree 
mixtures. 



D.5 Analysis of Tree Approximations: Proof of Theorem [3] 

We now rela te the perturbation of pro bability vector to perturbation of the corresponding mutual 
information (ICover and Thomasl . l2006l ) . Recall that for discrete random variables X, Y, the mutual 
information I(X;Y) is related to their entropies H{X,Y), H[X) and H{Y) as 



/(X; Y) = H{X) + H{Y) - H{X, Y), 



(64) 
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and the entropy is defined as 



HiX) ■■=-Yl P{X = x) log P{X = x), 



x&X 



where X is the sample space of X. We recall the following result from ( Shamir et all 120081 ) 
function (hix) for x £ M^ as 



4>{x) 



0, X = 0, 

— xlogx, xG(0, 1/e), 
^ 1/e, o.w. 



(65) 
. Define 

(66a) 
(66b) 
(66c) 



Proposition 3 For any a,b e [0, 1], 

|aloga — blogb\ < (j){\a — b\), 
for </>(•) defined in ()66p . 



(67) 



We can thus prove bounds on the estimated mutual information pp°'^*(.) using statistics P'^p'^'^*(-) 
obtained from spectral decomposition. 

Proposition 4 (Bounding |pp<==t(.) - /(.)|) Under the event that \\P''P''''\Ya,Ya\H = h)-P{Ya,Ya\H 
h)\\2 < €, we have that 



\ppect ^Ya;Ya\H = h) - I{Ya;Ya\H = h)\ < 3#(e). 



(68) 



For success of Chow-Liu algorithm, it is easy to see that the algorithm finds the correct tree 
when the estimated mutual information quantities are within half the minimum separation i? defined 
in ()20p . This is because the only wrong edges in the estimated tree T^ are those that replace a 
certain edge in the origin al tree T/j, without violating the tree constraint. Similar ideas have been 
used bv lTan et al.l (|201ll ) for deriving error exponent bounds for the Chow-Liu algorithm. Define 



,tree 



OM-T 



(69) 



Thus, using the above result and assumption (All) implies that we can estimate the mutual 
information to required accuracy to obtain the correct tree approximations. 

E Analysis Under Local Separation Criterion 

E.l Rank Tests Under Approximate Separation 

We now extend the results of the previous section when approximate separators are employed in 
contrast to exact vertex separators. Let S := Sioca.i{u,v;G,j) denote a local vertex separator 
between any non-neighboring nodes u and v in graph G under threshold 7. We note the following 
result on the probability matrix M„^^ j^.^}. defined in @. 
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Lemma 11 (Rank Upon Approximate Separation) Given a r-mixture of graphical models 
with G = U\^^Gk, for any nodes u,v ^V such that J\f[u\ r\J\f[v\ = and S := Siocii.i{u,v;G,j) be 
any separator of u and v on G, the probability matrix Af^^^j^.^j := [P[Yu = i-,Yv = j, Y5 = A;]]jj 
has effective rank at most r for any k G y'^' 

Rank {M,,,,{s;k} ; C(7)) < r, \f k e 3^l^l , (u, v) i G, (70) 

where ("(7) := 2v"niax^gr^i Chil), o^i^d Ch{') is the correlation decay rate function in (|23p corre- 
sponding to the model P{y\H = h) and 7 is the path threshold for local vertex separators. 

Notation: For convenience, for any node v £ V, let P{Yy\H = h) := P{Yy\H = h; G^) denote the 
original component model Markov on graph Gh, and let P{Yy) denote the corresponding marginal 
distribution of Y^ in the mixture. Let P"'{Y^\H = h) := P(Yv\H = h;Fy^h) denote the component 
model Markov on the induced subgraph F-y^^ := Gh{B^{v)), where B^{v; Gh) is the 7-neighborhood 
of node v in Gh- In other words, we limit the model parameters up to 7 neighborhood and remove 
rest of the edges to obtain P^ {Yy \H = h). 

Proof: We first claim that 

\Wu\v,{S;k} - M^\H,{S;k}MH\v,{S;k}\\2 < C{l)- (71) 

Note the relationship between the joint and the conditional probability matrices: 

^u,v,{S;k} = ^u\v,{S;k} ^''^^Z{'^v,{S;k})-, (72) 

where T^v,{S;k} •= [PO'^v = i,^s = ^)]7 i^ ^^^ probability vector and Diag(-) is the diagonal matrix 
with the corresponding probability vector as the diagonal elements. Assuming (I7ip holds and 
applying ([72]) . we have that 

\Wu,v,{S;k} - Mu\H,{S;k}MH\v,{S;k} ^^^Si'^v,{S;k}')\\2 

< ||Diag(7r,,|S;fe})ll2C(7) < C(7), (73) 

since ||Diag(7r^^|s.fc|.G)||2 < l|Diag(7r„^{5;fc};G)llF = \\'^v,{S;k};G\\2 < 1 for a probability vector. From 
Weyl's theorem, assuming that (f73|) holds, we have 



Rank {M^^^^^g-M ; Cil)) < min(r, d) = r, 

since we assume r < d (assumption (Bl) in Section fB.Sp . Note that Rank(A; ^) denotes the effective 
rank, i.e., the number of singular values of A which are greater than ^ > 0. 

We now prove the claim in (i7T]l . Since G = Uj^^^G/^, we have that the resulting set S := 
5iocai(^7^; ^,7) is also a local separator on each of the component subgraphs {Gh}he[r] of G, for 
ah sets A,B CV such that M[u; G] n M[v; G] = 0. Thus, we have that for all k £ J^l'^l, y^ £ y, 
h£[r], 

P^Yu\Y, = y,,Ys = k,H = h) = P^Yu\Ys = k,H = h). (74) 



The statement in (i74|l is due to the fact that the nodes u and v are exactly separated by set S in 
the subgraph F^^hiu). 



30 



By assumption (B4) on correlation decay we have that 

\\P{Yu\Y, = yv,Ys = k,H = h)- P^(Yu\Y, = y,,Ys = k,H = h)\\^ < (hil), 
for all yi, G 3^, A; G 3^' ' and /i G [r]. Similarly, we also have 

||P(y„|Ys = k,H = h)- P''(Yu\Ys = k,H = h)\\^ < Chil), 
which implies that 

\\PiYu\Y, = y,,Ys = k,H = h)- PiYu\Ys = k,H = h)\\^ < 2^(7), 
for all yv ^ y , k ^ y'^' and h £ [r], and thus, 

\\Mu\v,{S;k} - Mu\H,{S:k}MH\v,{S;k}\\i < 2maxCh(7), (75) 

where ||j4||-^ of a matrix is the maximum column- wise absolute sum. Since \\A\\2 < V"||j4||^, (j7ip 
follows. □ 

E.2 Spectral Decomposition Under Local Separation 

We now extend the above analysis of spectral decomposition when a local separator is used instead 
of approximate separators. For simplicity consider nodes u^,a,b,c £ V (the same results can also be 
proven for larger sets), where «=„ is an isolated node in G\j, a,b G V\ {«*}, c ^ J\f[a; G\j] U AA[6; G\j] 
and let S := SiocaiHa, b), c; G\j) be a local separator in G\j separating a, b from c. Since we have 

Yu^ X Yy\|„^}|i7, 

the following decomposition holds 

Mu,,c,{S;k} = Mu,\H^^ag{7VH,{S;k})Mj\H^{S;k}- 

However, the matrix M^^^ c,{S:fc},{(a,6);g} no longer has a similar decomposition. Instead define 

Mu„c,{S;k},{{a,by,q} ■= Mu,\H^^ag{'^ H,{S;k},{{a,by,q})M^\jjj^g.f^y (76) 



Define the observable operator, on lines of (j38p . based on M above rather than the actual probability 
matrix M, as 

C(m) := lu^ \Tm{q)M,,^^,^{s-,kU(a,by,'i}] ^2] (C/^M^.^.^i^.^i^/s)-' , (77) 

where Ui is a matrix such that Uj My^jfj is invertible and U2 is such that Uj My\fjfg.f.j is invertible. 
On lines of Lemma IH we have that 

C(m) = (C/f M,.|^) Diag (m^ ,^|^^^^^,^m) (f/f M,^,^)"' . (78) 

Thus, the r roots of the polynomial A 1— )• det(C(m) — XI) are {(m, M(^;,)|j:^|5.;j}ej) : j G [r]}. We 
now have show that M and M are close under correlation decay. 
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Proposition 5 (Regime of Correlation Decay) For all k G y'^' and q £ y"^, we have 

\\Mu,,c,{S;k},{{a,b);q} " ^u^,c,{S;k},{(a,b);q}\\2 ^ C(7)) (79) 

where Cil) is given by (PU|) . 

Proof: On lines of obtaining ()75p in the proof of Lemma [TTl it is easy to see that 

\\PiY,\Ys = k,Ya,b = q)-y] P{Yc\Ys = k,H = h)P{H = h\Y s = k,Ya,b = q)\\i < 2maxa(7). 

This implies that for all y & y, 

II ^ PiYu, = y\H = h)P{H = h,Ys = k,Ya,b = q)P{Yc\Ys = k,H = h) 

he[r] 

- P(Y„Y^^ = y,Ys = k,Ya,b = q)\\i < 2maxa(7)- (80) 

her] 



This is the same as 



|-^«,,c,{S;fc},{(a,6);q} " ^4.,c,{5;fc},{{a,fe);g} lloo ^ 2maxC/i(7), (81) 

ne|r] 



where ||^||oo is the maximum absolute row sum and ||vl||2 < \/d||^||oo for a d x d matrix, and thus, 
we have the result. □ 

E.3 Spectral Bounds under Local Separation 

The result follows on similar lines as Section ID. 31 except that the distortion between the sample 

version of the observable operator C(m) and the desired version C(ni) changes. This leads to a 
slightly different bound 

Lemma 12 (Bounds for \\Ma^h\H,{S;k}ej - Mc,^i,\H,{S;k}^T{j)\\2) For any a, be V\{u^}, k € 3^l^l, 
j G [r], there exists a permutation t{j) £ [r] such that, conditioned on event that G\j = G\j, with 
probability at least 1 — 36, 

- — ^ K{o' T) d r] 

ll^a,6|H,{S:fe}ej - Ma,fe|H,{S;fe}er{i)l|2 < '-^ — + K' {5]p,d,r)Q{-i), (82) 



where K' and K are given by (157]) and ([58]) . and ^(7) is given by ([26]) . This implies 

2K{6;p,d,r) 



\Ma,b\H^j - ^a,6|He^(j)l|2 < 1=^ 

\n 



+ 2K'{6;p,d,r)C{-f). (83) 



F Matrix perturbation analysis 

We borrow the following results on matrix perturbation bounds from ( Anandkumar et al.l . l2012bl ). 



We denote the p-norm of a vector v by ||iT||p, and the corresponding induced norm of a matrix A by 
||j4||p := sup- /(J II Aullp/ll-ullp. The Frobenius norm of a matrix A is denoted by ||j4||f. For a matrix 
A e M™^", let k{A) := cri{A)/a^in{m,n){^) (thus k{A) = \\A\\2 ■ P"^||2 if A is invertible). 
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Lemma 13 Let X G R"^^" be a matrix of rank k. Let U G W^^^ and V G M"^'^ be matrices 
with orthonormal columns such that range(C/) and range(y) are spanned by, respectively, the left 
and right singular vectors of X corresponding to its k largest singular values. Similarly define 
U G M'"^'^ and V G M"^'' relative to a matrix X G R""^". Define ex := ||^ - X\\2, eo := ^^, 
and ei := j-r^- Assume Eq < ^. Then 

1. El < 1; 

2. ak{X) = akiU^XV) > (1 - eo) • ^fc(^) > 0; 



3. akiU^U) > Vl^; 

5. ak{U^XV)>{l-E^^)-ak{X); 

6. for any S G M and v G range(L''), \\Ua — v\\2 < \\a — U^v\\2 + H^Hi ' ^i- 

Lemma 14 Consider the setting and definitions from Lemma [T3[ and letY G W^^^ andY G M™^" 
be given. Define E2 := .,_ ^s f°_ — 3— jr and ey := \\Y — Y\\2. Assume Eq < ^ ^ . Then 

1. U^XV and U^XV are both invertible, and \\{U^ XV)-^ - 0^ XV)-% < ^f^; 

2. \\iU-YV)iU-XVr^ - iU-YV)iU-XV)-% < (T^^^^rpo + ^- 

Lemma 15 Let A G M be a diagonalizable matrix with k distinct real eigenvalues Ai, A2, • • • , Afc G 
M corresponding to the (right) eigenvectors ^i,^2) . . . ,^k ^^^ clU normalized to have \\^i\\2 = 1- Let 
i? G M be the matrix whose i"' column is ^j. Let A G M be a matrix. Define ea ■= ||^ — ^||2j 
7A := inirij^j |Aj — Aj|, and £3 := . Assume £3 < 2- Then there exists a permutation r on 

[k] such that the following holds: 

1. A has k distinct real eigenvalues Ai, A2, . . . , Afc G M, and \\r(i) ~ '^il ^ £3 ■ 7yi for all i G [A;]; 

2. A has corresponding (right) eigenvectors Cii^2i---)Cfe ^ 1^'^; normalized to have \\Ci\\2 = 1; 
which satisfy \\^.j-{i) ~ Cilb < 4(A; — 1) • ||i?~-^||2 • £3 for all i G [A;]; 

3. the matrix R G M*^^'^ whose i^*" column is iT{i) satisfies \\R — R\\2 < ||i? — i?||F < 4A;^/^(A; — 1) • 

IIR-Ill c- 
ll-K II2 • £3- 

Lemma 16 Let Ai,A2, . . . ,Ak G M be diagonalizable matrices that are diagonalized by the same 
matrix invertible R G M'^^'^ with unit length columns \\Rej\\2 = 1, such that each Ai has k distinct 
real eigenvalues: 

R'^AiR = Diag(Ai,i, Aj,2, • • • , \,k)- 



«j I' 



Let Ai,A2, . . . , Ak G R be given. Define eA '■= maxj \\Ai — Ai\\2, 7a '■= niiiij min^yj' [Ajj- — A 

Amax := niaxjj \Xij\, £3 := , and £4 := 4k^'^ ■ \\R~^\\2 • £3- Assume £3 < ^ and £4 < 1. Then 

there exists a permutation r on [k] such that the following holds. 
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1. The matrix Ai has k distinct real eigenvalues Xi^i,\i^2, ■ ■ ■ ,^i,k S ^, cind |Aij — ^i^t{j)\ — 
£3 • lA for all j € [k] . 

2. There exists a matrix R G MJ^^^ whose j"' column is a right eigenvector corresponding to \ij, 
scaled so ||i?ej||2 = 1 for all j G [k], such that \\R — Rt\\2 ^ wr-^w ' where Rr is the matrix 
obtained by permuting the columns of R with r. 

3. The matrix R is invertible and its inverse satisfies \\R^^ — R^^Wi < Il-R~"^ll2 • i ^"^ ," 

4- For all i G {2,3, ... ,k} and all j G [A;], the [j,])^'' element of R^^AiR, denoted by Ajj := 
ejR~^AiRej, 



ej R ^AiRcj, satisfies 



\^iJ - Krij)\ < ( 1 + -. ) • ( 1 + -7? 7^:7 ) • £3 • 7A 



I- £4 J V Vk-K{R) 



. X / 1 1 1 £4 \ 

+ k{R) ■ h —7= -— + —7= ■ • £4 • Amax- 

\l-e4 y/k-K{R) Vk I-SaJ 

If ei<\, then Wj - \i^r{j)\ < 3e3 • 7^ + ^n{R) ■ £4 • Amax- 

Lemma 17 Let V G M!'^^ be an invertible matrix, and let R G M.^^^ be the matrix whose j"* 
column is Vej/\\Vej\\2. Then \\R\\2 < k{V), \\R~^\\2 < k{V), and k{R) < K{Vf. 
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